Script Identification from Printed Document Images Using Statistical Features
نویسنده
چکیده
Automatic identification of a script in a document image facilitates many important applications such as automatic archiving of multilingual documents; searching online archives of document images and for the selection of script specific OCR in a multilingual environment. In this work a technique for script identification from document images is proposed. The method uses vertical and horizontal run components/objects of words of a single line of text to distinguish 3 Indian scripts: Kannada, Hindi and English. Initially, the method segments words from the selected line of text from a document image. Then statistics of horizontal and vertical run objects are determined. Further, linear discriminant function is used to identify script of the document image as Kannada, Hindi or English script. The method has been tested for 300 document images and the method found to be robust and efficient. The proposed system achieves 93% identification accuracy for Hindi script, 90% identification accuracy for English script and 86% identification accuracy for Kannada script.
منابع مشابه
Global Approach for Script Identification using Wavelet Packet Based Features
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documen...
متن کاملHandwritten Script Identification from a Bi-Script Document at Line Level using Gabor Filters
In a country like India where more number of scripts are in use, automatic identification of printed and handwritten script facilitates many important applications including sorting of document images and searching online archives of document images. In this paper, a Gabor feature based approach is presented to identify different Indian scripts from handwritten document images. Eight popular In...
متن کاملWavelet Packet Based Texture Features for Automatic Script Identification
In a multi script environment, an archive of documents printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify the script type of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in ten Indian scripts ...
متن کاملEntropy Based Texture Features Useful for Automatic Script Identification
In a multi script environment, a collection of documents printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition, it is necessary to identify the script type of the document. In this paper, a novel texture-based approach is presented to identify the script type of the documents printed in three prioritized scripts Kannada, Hi...
متن کاملScript Identification of Text Words from a Tri Lingual Document Using Voting Technique
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, H...
متن کامل